Supporting data for this practical session are:
Rainfall data from Orlando, Florida, USA, see Florida_rainfall.RData,
Financial data on the CAC40 index, see CAC40.csv,
The Society of Actuaries (SOA) Group Medical Insurance Large
Claims Database from the R package ReIns,
2010 US Census data ranking US cities by population, see census_USA_2010.xlsx.
You may not have the time to work on all these real data examples.
The idea is to allow you to work on a data set that suits your
interests. In each case, it will be important to code the estimators
yourself, and to compare your procedures with those of R packages such
as evir, evd, evt0 and
extRemes. As a preliminary step, you may therefore want to
do the following things:
For a fixed sample of data of size \(n\) and \(1\leq k\leq\), implement the Hill and Weissman estimators. The result should be a pair of vectors of size \(n−1\). As one of the samples of data is fairly large, it will be useful to speed the code up as much as you can.
Implement diagnostic tools (QQ-plots…) to judge whether the heavy-tailed assumption makes sense.
Download a few packages. Good choices in this practical session
are evir and evt0. These will help to see if
your results make sense (and you might find that one of the packages
does something wrong!)
The variable of interest is sum_rain_2m_inches,
representing daily total rainfall at a weather station near Orlando,
Florida, USA, during the American storm season, which lasts from August
to October. Accompanying this variable is the day of recording, from 1st
August 1998 to 31st October 2020.
load("~/Documents/GIT/teaching/extreme value theory/data sets for practical sessions/Florida_rainfall.RData")
rainfall.data <- dataset[,-1]#rainfall.data.plus <- rainfall.data$sum_rain_2m_inches[rainfall.data$sum_rain_2m_inches>0]
#hist(rainfall.data.plus, breaks = 50, probability = TRUE)Can you find a reasonable statistical model for the whole of the data?
Check that there is evidence of a heavy right tail in this data set.
Calculate an estimate for the extreme value index of the data. Justify your choice of tuning parameters.
Calculate an estimate for the extreme quantiles at level 0.99, 0.995 and 0.999. Do these make sense?
Can you provide a confidence interval for the extreme value index and for these extreme quantiles? If so, why? If not, can you suggest a method that would allow you to do so?
Financial data about stock market indices and equity prices are
typically not stationary. When using this data, isolate first the
Close variable, representing daily closing prices. Then,
construct the daily log-return variable \(X_t\) defined as follows: if \(S_t\) is the closing price on day \(t\), the daily log-return on day \(t\) is \(\log(S_t/S_{t-1})\). This variable \(X_t\) will be the variable of interest,
rather than the nonstationary price \(S_t\). Accompanying this variable is the
date of recording, from 1st March 1990 to 8th July 2021.
cac40.data <- read.csv("~/Documents/GIT/teaching/extreme value theory/data sets for practical sessions/CAC40.csv", header = TRUE)
cac40.data$Close <- as.numeric(cac40.data$Close)
log.close <- data.frame('Date'=cac40.data$Date, 'lclose'=log(cac40.data$Close))
log.close <-na.omit(log.close) #na.omit to get rid of NAs It’s also possible to remplace the NAs with the media in a window around the missing value if you to get fancy.
## [1] FALSE
## [1] TRUE
log returns can be computed by differencing the logarithms of the prices
log_return <- data.frame('Date'=log.close[-1,1], 'logret'=diff(log.close[,2]))
# or you can just use
log_return2 <- data.frame('Date'=log.close$Date[-1], 'logret'= log.close[2:nrow(log.close),2]-log.close[1:(nrow(log.close)-1),2])
The boxplot shows very heavy tails (first sight).
par(mfrow=c(1,2))
stripchart(log_return$logret, col='red') #don't bother
rug(log_return$logret)
hist(log_return$logret)
rug(log_return$logret, col='red')
we once again see a rather heavy tail.
We now show the time series:
xts (eXtensible Time Series) package to
maximize the preservation of information in native format.library(xts)
log_return.ts <- as.xts(log_return$logret, as.Date(log_return$Date))
plot(log_return.ts, main="Log Returns")
- Stationary test
We are dealing with log returns, so stationarity is not an issue. However, we cans till test for it using the ADF and/or KPSS tests.
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -89.7 0.01
## [2,] 1 -64.3 0.01
## [3,] 2 -54.2 0.01
## [4,] 3 -45.4 0.01
## [5,] 4 -42.3 0.01
## [6,] 5 -39.1 0.01
## [7,] 6 -35.3 0.01
## [8,] 7 -32.9 0.01
## [9,] 8 -31.3 0.01
## [10,] 9 -29.7 0.01
## [11,] 10 -28.1 0.01
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -89.7 0.01
## [2,] 1 -64.3 0.01
## [3,] 2 -54.2 0.01
## [4,] 3 -45.4 0.01
## [5,] 4 -42.3 0.01
## [6,] 5 -39.1 0.01
## [7,] 6 -35.3 0.01
## [8,] 7 -32.9 0.01
## [9,] 8 -31.3 0.01
## [10,] 9 -29.7 0.01
## [11,] 10 -28.1 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -89.7 0.01
## [2,] 1 -64.3 0.01
## [3,] 2 -54.2 0.01
## [4,] 3 -45.4 0.01
## [5,] 4 -42.3 0.01
## [6,] 5 -39.1 0.01
## [7,] 6 -35.3 0.01
## [8,] 7 -32.9 0.01
## [9,] 8 -31.3 0.01
## [10,] 9 -29.7 0.01
## [11,] 10 -28.1 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
library(urca)
summary(ur.df(log_return$logret, lags=trunc(sqrt(length(log_return$logret))), selectlags = "AIC"))##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression none
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.134613 -0.006484 0.000506 0.007406 0.098843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## z.lag.1 -1.11500 0.02865 -38.916 < 2e-16 ***
## z.diff.lag1 0.10798 0.02590 4.169 3.10e-05 ***
## z.diff.lag2 0.08989 0.02308 3.895 9.91e-05 ***
## z.diff.lag3 0.04942 0.01973 2.505 0.0123 *
## z.diff.lag4 0.06775 0.01599 4.237 2.29e-05 ***
## z.diff.lag5 0.02338 0.01128 2.073 0.0382 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01375 on 7857 degrees of freedom
## Multiple R-squared: 0.5057, Adjusted R-squared: 0.5053
## F-statistic: 1340 on 6 and 7857 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -38.9162
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau1 -2.58 -1.95 -1.62
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 11 lags.
##
## Value of test-statistic is: 0.0607
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The results seem unconvincing. Let’s check the squared returns
instead:
There it is! - severe autocorrelation.
Remark: this dependence will cause issues with variance/SE estimation.
hist(log_return$logret, probability = TRUE, ylim=c(0,max(density(log_return$logret)$y)))
lines(density(log_return$logret), lwd=2, col="red")
rug(log_return$logret)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1309835 -0.0065411 0.0004147 0.0001572 0.0073485 0.1059459
It appears bell-shaped, but the tails are too heavy to be Gaussian. We’ll check it anyway.
library(fitdistrplus)
norm.fit <- fitdistrplus::fitdist(log_return$logret, "norm", method = 'mle')
plot(norm.fit)Problem in the tail of the distribution. We will try with a student distribution.
The result is horrible. However, it must be the wrong location-scale.
#manual creation of LS student. Exercice
dt_ls <- function(x, df=1, mu=0, sigma=1) 1/sigma * dt((x - mu)/sigma, df)
pt_ls <- function(q, df=1, mu=0, sigma=1) pt((q - mu)/sigma, df)
qt_ls <- function(p, df=1, mu=0, sigma=1) qt(p, df)*sigma + mu
rt_ls <- function(n, df=1, mu=0, sigma=1) rt(n,df)*sigma + mu
t.fit2<- fitdistrplus::fitdist(log_return$logret, 't_ls', start =list(df=1,mu=mean(log_return$logret),sigma=sd(log_return$logret)))
t.fit2## Fitting of the distribution ' t_ls ' by maximum likelihood
## Parameters:
## estimate Std. Error
## df 3.6066698101 0.1588165266
## mu 0.0004560405 0.0001268895
## sigma 0.0094429860 0.0001335163
The result is much more satisfactory.
There is obviously an automatic version:
library(metRology)
t.fit3 <- fitdistrplus::fitdist(log_return$logret, "t.scaled", start=list(df=3,mean=mean(log_return$logret),sd=sd(log_return$logret)))
plot(t.fit3) #identical results to t.fit2## Fitting of the distribution ' t.scaled ' by maximum likelihood
## Parameters :
## estimate Std. Error
## df 3.6071226283 0.1589293320
## mean 0.0004540197 0.0001268762
## sd 0.0094414481 0.0001335199
## Loglikelihood: 23469.55 AIC: -46933.1 BIC: -46912.16
## Correlation matrix:
## df mean sd
## df 1.00000000 -0.04056638 0.67096304
## mean -0.04056638 1.00000000 -0.03741193
## sd 0.67096304 -0.03741193 1.00000000
The visual inspection done earlier clearly implies that the data is heavy tailed. If you want to confirm that, use the moment/generalized Hill/MLE estimators of the extreme value index, and check if you end up with positive values.
library(ReIns)
mom<-Moment(log_return$logret[log_return$logret>0])
plot(mom$gamma[1:1000], type='l', lwd=1, ylim=c(0,0.4))
The extreme value index is clearly positive, which confirms that the
data is heavy-tailed.
Since the data is heavy-tailed, we use the hill estimator to find reasonable values of the tail index.
k=1:1000
alpha<-0.05
library(evt0) #trusted package (you can use ReIns as well)
evi <-mop(log_return$logret, k=k, p=0) #p=0 for Hill estimator (see Mean of order p (mop) statistic for the extreme value index)
plot(k,evi$EVI, type='l', lty=1)
#confidence intervals (these are meaningless in the context of this data)
qcrit<-qnorm(1-alpha/2)
u<-evi$EVI+qcrit*evi$EVI/sqrt(k) #confidence intervals
l<-evi$EVI-qcrit*evi$EVI/sqrt(k) #however, they're meaningless because of dependence
lines(k,u,lty=3,col='red')
lines(k,l,lty=3,col='red')library(evir)
evi.evir <- hill(log_return$logret, start=2, end=1000, option = "xi")
lines(2:1000, evi$EVI[2:1000], col="blue") Trouble signs, evir has issues with its hill code, so
don’t use it !
Stick to mop from evt0, or Hill from ReIns
or write it yourself:
hillmanual <- function(y,k){
n=length(y)
ysort=sort(y)
som1=log(ysort[n])
evi<-som1-log(ysort[n-1])
for (i in 1:k){
som1 = som1 + (log(ysort[n-i]))
ev = som1/(i+1)-log(ysort[n-i-1])
evi<-c(evi,ev)
}
return(evi)
}evimanual<-hillmanual(log_return$logret, 999)
plot(k, evimanual, ylim=c(0,0.5), type='l', lwd=2)
lines(1:1000,evi$EVI[1:1000], col='red', lty=3, lwd=2)
Perfect match with mop !
hillmanualfast<-function(y,k){
ysort=sort(y, decreasing = T)
ysort<-ysort[ysort>0]
logy<-log(ysort)[1:k]
cumlogy<-cumsum(logy)/(1:k)
xi<-c(NA,(cumlogy[-length(cumlogy)]-logy[-1])[2:k])
return(xi)
}evimanualfast<-hillmanualfast(log_return$logret, k=1000)
plot(evimanualfast, type='l', lwd=2)
lines(1:1000,evi$EVI[1:1000], col='red', lty=2, lwd=2) #Perfect match with mop !
## user system elapsed
## 0.001 0.000 0.000
## user system elapsed
## 0.001 0.000 0.001
## user system elapsed
## 0.001 0.000 0.001
Quite similar in all three cases, but the code with loops is a bit slower (results may vary depending on pc specs).
To extrapolate, we need to choose a specific value of the Hill estimator, which corresponds to the choosing k
n<-length(x)
gam<-evimanualfast
plot(gam, type='l', lwd=2) #check for stable region, then pick the value of k at the end of it
#250-400 looks fairly stable, so we choose k=400
abline(v=400) #check choice of k using exponential QQ
kn<-400
sorted<-sort(log(x), decreasing = T)
exc<-(sorted-sorted[kn])[1:kn] #these log exceedances should be exponentially with mean=tail index
any(is.na(exc)) #a quick way to check for NAs in your data## [1] FALSE
The result looks good !
#we now have our tail index estimate
gamhat<-gam[kn]
#0.995 extrapolation (0.995 might seem extreme, but the sample size is quite large)
ext.quantile <-quantile(x,1-kn/n, type=3)*((1-0.995)/(kn/n))^{-gamhat} #type=3 is needed for order statistics
empirical.quantile <- quantile(x,0.995)
c(ext.quantile, empirical.quantile)## 94.97045% 99.5%
## 0.04421507 0.04247024
Extrapolated and empirical quantiles are very close.
## [1] 0.9998743
ext.quantile.t<-quantile(x,1-kn/n, type=3)*((1-taunp)/(kn/n))^{-gamhat}
empirical.quantile.t <- quantile(x,taunp)
c(ext.quantile.t, empirical.quantile.t)## 94.97045% 99.98743%
## 0.15015826 0.09617054
Extrapolation goes way past the empirical quantile, as expected.
For confidence intervals, we already know they’re meaningless because of dependence. However, here’s a quick way to calculate these meaningless intervals
c1 <-ext.quantile.t * (1-qnorm(1-alpha/2)*log((kn/n)/(1-taunp))*sqrt(1/kn)*gamhat)
c2 <-ext.quantile.t * (1+qnorm(1-alpha/2)*log((kn/n)/(1-taunp))*sqrt(1/kn)*gamhat)
c(c1,ext.quantile.t,c2)## 94.97045% 94.97045% 94.97045%
## 0.1208903 0.1501583 0.1794263
You can plot extrapolated quantiles and their CIs against \(k\) as well using
ext<-quantile(x,1-k/n, type=3)*((1-taunp)/(k/n))^{-gam} #make use of the vectorization, no loops
c1 <-ext * (1-qnorm(1-alpha/2)*log((k/n)/(1-taunp))*sqrt(1/k)*gam)
c2 <-ext * (1+qnorm(1-alpha/2)*log((k/n)/(1-taunp))*sqrt(1/k)*gam)
plot(k,ext, type = 'l', ylim=c(0,0.3))
lines(k,c1,col='red',lty=2)
lines(k,c2,col='red',lty=2)
abline(v=kn)Remark To generate meaningful confidence intervals,
we need to account for dependence. Prof. Stupfler’s package
ExtremeRisks has some nice tools for that
The only variable provided in this example is the amount of money
related to a medical claim exceeding $25,000 in 1991 in the USA. This
data set can be loaded by typing data(soa) after having
loaded the ReIns package.
Represent the data. Can you find a reasonable statistical model for the whole of the data?
Check that there is evidence of a heavy right tail in this data set.
Calculate an estimate for the extreme value index of the data. Justify your choice of tuning parameters.
Calculate an estimate for the extreme quantiles at level 0.995 and \(1 − 1/n\), where \(n\) denotes the sample size. Do these make sense? Do you think that the level 0.995 can be considered extreme here?
Can you provide a confidence interval for the extreme value index and for these extreme quantiles? If so, why? If not, can you suggest a method that would allow you to do so?
The variable of interest is the 2010 population size in US cities having more than 50,000 inhabitants. Other variables provided are population projections in subsequent years.
#library("readxl")
census <- read.csv("~/Documents/GIT/teaching/extreme value theory/data sets for practical sessions/census_USA_2010.csv", skip = 3, sep = ";")
census$Census <- as.numeric(census$Census)
census.2010 <- census$CensusRepresent the data. Can you find a reasonable statistical model for the whole of the data?
Check that there is evidence of a heavy right tail in this data set.
Calculate an estimate for the extreme value index of the data. Justify your choice of tuning parameters.
Calculate an estimate for the extreme quantiles at level 0.99, 0.995 and 0.999. Do these make sense? Can you provide an interpretation for such quantiles?
Can you provide a confidence interval for the extreme value index and for these extreme quantiles? If so, why? If not, can you suggest a method that would allow you to do so?